Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

162

Useful Algorithms

The radius rr must fall between the minimum and maximum distances between

the objects. The larger its value, the fewer will be the number of clusters. Possibly,

other criteria are needed to select the most appropriate value (e.g., from some prior

estimation of the likely number of clusters). The method of dynamic kernels is

analogous to hypersphere clustering.

The upper KK-means method. This method originated from the so-called iterative self-

organizing data analysis (ISODATA) technique. The centres ofupper KK clusters are chosen

simultaneously. Denoting the centre of the kkth cluster by upper Z Subscript kZk, k equals ModifyingAbove 1 comma upper K With quotation dashk = 1, K, then for the

process of cluster formation, in particular for the incorporation of any object upper XX into

cluster upper C Subscript kCk, we have

StartLayout 1st Row 1st Column r Subscript n minus 2 2nd Column equals 3rd Column r Subscript n minus 1 Baseline q Subscript n minus 1 Baseline plus r Subscript n Baseline comma 2nd Row 1st Column Blank 3rd Row 1st Column r Subscript n minus 1 2nd Column equals 3rd Column r Subscript n Baseline q Subscript n Baseline comma EndLayoutX ∈Ck

if ρ(X; Zk) ≤ρ(X; Zi) ,

(13.5)

wherek equals ModifyingAbove 1 comma upper K With quotation dashk = 1, K,i not equals ki /= k. In the next step, new centres of gravity for theupper KK subclusters

are computed. In the stepll, for each new dividingupper D Subscript lDl the functionalupper F left parenthesis upper D Subscript l Baseline right parenthesisF(Dl)is computed

by the expression

X∈Ckl

(X −Zkl)².

(13.6)

The optimal division is that for which the function upper FF takes its minimal value. The

process of dividing goes on until for the centres of the next two steps the condition

(13.7)

is satisﬁed. The effectiveness of this algorithm depends on the chosen value of upper KK,

the selection of the initial clustering centres, and the actual location of the points in

feature space corresponding to the objects, which together constitute a signiﬁcant

weakness of this method.

Distance metrics. The calculation of a distance between any two objects is funda-

mental to clustering. In Euclidean space, the operation is intuitively straightforward,

especially when the positions of each object in space are represented using Cartesian

coordinates. Thus, in one dimension, the distance between two objects at positions

x 1x1 andx 2x2 is simply their difference,StartAbsoluteValue x 1 minus x 2 EndAbsoluteValue|x1 −x2|. The procedure is generalized to higher

dimensions using familiar knowledge of coördinate geometry (Pythagoras’ theorem);

thus, for two orthogonal axesxx andyy, the distance isStartRoot left parenthesis x 1 minus x 2 right parenthesis squared plus left parenthesis y 1 minus y 2 right parenthesis squared EndRoot

(x1 −x2)²+ (y1 −y2)². The

space must be chosen according to relevance. Thus, a collection of trees might be

characterized by height and the mean rate of photosynthesis per unit area of leaf.

Each member of the collection (set) would correspond to a point in this space. An

explicit procedure must be provided for assigning numerical values to these two

parameters. Ex hypothesi, they are considered to be independent; hence, the axes are

orthogonal. Especially when the number of dimensions of the chosen space is high,

it is convenient to reduce it to two, because of the inescapable convenience of repre-

senting data as a picture in two dimensions. For this purpose, principal component

analysis, described in the next Sect. 13.2.2, is a useful method.